Clustering

When establishing relationships between variables of blockgroups, not all blockgroups may exhibit the same relationships. Clustering may allow us to identify different underlying behaviors. This html file is meant to open up the tool of clustering.

Load social distancing data and blockgroups

Load the Safegraph social distancing data, San Jose blockgroups, and Census data of interest.

pcloud <- '~/pCloud Drive/'

# get SJ blockgroups 
# get San Jose block groups
scc_blockgroups <- block_groups("CA","Santa Clara", cb=F, progress_bar=F)

# Use tracts sent to us by San Jose
sj_tracts <- st_read(paste0(pcloud,"SFBI/Data Library/San_Jose/CSJ_Census_Tracts/CSJ_Census_Tracts.shp")) %>%
  st_as_sf() %>%
  st_transform(st_crs(scc_blockgroups))
## Reading layer `CSJ_Census_Tracts' from data source `/Users/indra/pCloud Drive/SFBI/Data Library/San_Jose/CSJ_Census_Tracts/CSJ_Census_Tracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 219 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 6112856 ymin: 1869687 xmax: 6255982 ymax: 1996555
## epsg (SRID):    2227
## proj4string:    +proj=lcc +lat_1=38.43333333333333 +lat_2=37.06666666666667 +lat_0=36.5 +lon_0=-120.5 +x_0=2000000.0001016 +y_0=500000.0001016001 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
sj_citycouncil_disticts <- st_read(paste0(pcloud,"SFBI/Data Library/San_Jose/City Council Districts/CITY_COUNCIL_DISTRICTS.shp")) %>%
  st_as_sf() %>%
  st_transform(st_crs(scc_blockgroups))
## Reading layer `CITY_COUNCIL_DISTRICTS' from data source `/Users/indra/pCloud Drive/SFBI/Data Library/San_Jose/City Council Districts/CITY_COUNCIL_DISTRICTS.shp' using driver `ESRI Shapefile'
## Simple feature collection with 10 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 6112856 ymin: 1869687 xmax: 6255982 ymax: 1996555
## epsg (SRID):    2227
## proj4string:    +proj=lcc +lat_1=38.43333333333333 +lat_2=37.06666666666667 +lat_0=36.5 +lon_0=-120.5 +x_0=2000000.0001016 +y_0=500000.0001016001 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
# from code written by others to get SJ blockgroups
sj_blockgroups <-
  scc_blockgroups %>%
  st_centroid() %>%
  st_join(sj_tracts, left = F) %>%
  st_join(sj_citycouncil_disticts%>% dplyr::select(DISTRICTS)) %>%
  mutate(
    DISTRICTS = DISTRICTS %>% factor(levels = c("1","2","3","4","5","6","7","8","9","10"))
  ) %>%
  st_set_geometry(NULL) %>%
  left_join(scc_blockgroups%>% dplyr::select(GEOID), by = "GEOID") %>%
  st_as_sf() %>%
  dplyr::select(GEOID, DISTRICTS)

# the spatial join leaves off two blockgroups which are touching district 9. The following code assigns those to district 9
sj_blockgroups$DISTRICTS[is.na(sj_blockgroups$DISTRICTS)] <- 9

# code from others in the class to get social distancing data 
sj_socialdistancing <- readRDS(paste0(pcloud,'SFBI/Restricted Data Library/Safegraph/covid19analysis/sj_socialdistancing.rds')) %>% 
  mutate(date = date_range_start %>%  substr(1,10) %>% as.Date(),) %>% 
  left_join(sj_blockgroups, by = c("origin_census_block_group" = "GEOID")) %>% 
  filter(!is.na(DISTRICTS))

# obtaining weekends
weekends <-
  sj_socialdistancing %>% 
  filter(!duplicated(date)) %>% 
  arrange(date) %>% 
  mutate(
    weekend = 
      ifelse(
        (date %>% as.numeric()) %% 7 %in% c(2,3),
        T,
        F
      )
  ) %>% 
  dplyr::select(date,weekend)

sj_socialdistancing <- 
  sj_socialdistancing %>% 
  left_join(weekends)

# date of the shelter in place order
shelter_start <- "2020-03-16" %>% as.Date()

# get average staying at home on weekdays in January and February
sj_pre_sd_at_home_average <- sj_socialdistancing %>% 
  filter(weekend == F) %>% 
  filter(date <  as.Date("2020-03-01")) %>%
  group_by(origin_census_block_group) %>% 
  summarize(completely_home_device_count = sum(completely_home_device_count), device_count = sum(device_count)) %>% 
  mutate(`% Completely at Home Pre Shelter` = (completely_home_device_count/device_count*100) %>% round(1), `% not completely at home pre shelter` = (100 - `% Completely at Home Pre Shelter`))

Curate the data

Take the data and get the relevant information out of it.

#Curate socialdistancing data
sj_outside <- sj_socialdistancing %>% 
  rename(blockgroup = origin_census_block_group) %>%
  filter(weekend == F) %>% 
  filter(date >  as.Date(shelter_start)) %>%
  group_by(blockgroup) %>% 
  summarize(completely_home_device_count = sum(completely_home_device_count), device_count = sum(device_count)) %>% 
  mutate('perc_away'= 100 - (completely_home_device_count/device_count*100) %>% round(1))

# load in income data - code adapted from other students
sj_median_income_bg <-
  getCensus(
    name = "acs/acs5",
    vintage = 2018,
    region = "block group:*", 
    regionin = "state:06+county:085",
    vars = "B19013_001E"
  ) %>%
  mutate(
    blockgroup =
      paste0(state,county,tract,block_group)
  ) %>%
  rename(
    median_income = B19013_001E
  )

# get data on vehicles available as vehicles allocation
sj_vehicles_bg <- getCensus(
    name = "acs/acs5",
    vintage = 2018,
    region = "block group:*", 
    regionin = "state:06+county:085",
    vars = "B992512_001E"
  ) %>% 
  mutate(
    blockgroup =
      paste0(state,county,tract,block_group)) %>%
  rename(
      total_vehicles = B992512_001E
  ) 

Put data together and normalize

This is how clustering works: if your data entries have three types of data, then imagine that you plot the data in a 3 dimensional space, with each dimension representing a different type of data. In our case, those three data are, percentage time spent outside, median income, and number of vehicles. The clustering algorithm then groups points according to spatial proximity.

For this proximity to work correctly, the data needs to be normalized. Since percentage of people away from home are always beteen 0 and 1, while number of vehicles may be in the 100,000s, if we don’t normalize, then the percentage of people away from home may dominate how clusters are formed. So we divide each data feature by its standard deviation.

sj_data puts all the data together.

norm_sj_data has all the normalized data.

sj_data <- sj_outside %>%
          select(c('blockgroup','perc_away')) %>%
          inner_join(sj_median_income_bg %>% select(c('median_income', 'blockgroup'))) %>%
          inner_join(sj_vehicles_bg %>% select(c('total_vehicles', 'blockgroup'))) %>%
          filter(median_income > 0 & total_vehicles > 0)

norm_sj_data <- sj_data %>%
                mutate(median_income = median_income/sd(median_income),
                       total_vehicles = total_vehicles/sd(total_vehicles),
                       perc_away = perc_away/sd(perc_away),
                       blockgroup = NULL) #The blockgroup needs to be removed, else it affects clustering.
                       
plot(norm_sj_data$median_income, norm_sj_data$perc_away)

In the plot above, observe the small negative relationship between income and percent of people away from home.

Choosing number of clusters

We are going to use the classical kmeans algorithm for clustering. One key input for our clustering algorithm is the number of clusters. This is not information we can know beforehand. So we just cluster for many different numbers and measure the sum of the ‘within sum of squares’ (withinss). withinss measures the sum of the pairwise distances within each pair, and naturally, good clustering should reduce the withinss values for all clusters.

So we do clustering for number of clusters going from 1 to 15, sum up the withinss values for each cluster for a given number of clusters, and see the point where there is maximum drop in the total withinss value.

wss <- kmeans(norm_sj_data,centers=1)$tot.withinss

# We take iteration 2 to 15
for (i in 2:15) wss[i] <- kmeans(norm_sj_data,centers=i)$tot.withinss

# We plot the 15 withinss values. One for each k
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

It seems that 5 is a good candidate. Although, generally you can’t go wrong with having one less cluster.

Now, we do the clustering with centers = 5. We use the factoextra package to visualize. In this visualization, we use the original sj_data, rather than the normalized version.

clusters <- kmeans(norm_sj_data,centers=5)
fviz_cluster(clusters, data = sj_data %>% select(-blockgroup))

Impact of clustering

Let us revisit the income to social distancing relationship within each cluster. In particular, the relationship seems to have strongly reversed for cluster 2.

sj_data$cluster <- clusters$cluster
sj_data_cluster1 <- sj_data %>% filter(cluster == 1)
sj_data_cluster2 <- sj_data %>% filter(cluster == 2)
sj_data_cluster3 <- sj_data %>% filter(cluster == 3)
sj_data_cluster4 <- sj_data %>% filter(cluster == 4)
sj_data_cluster5 <- sj_data %>% filter(cluster == 5)

plot(sj_data_cluster1$median_income, sj_data_cluster1$perc_away)

plot(sj_data_cluster2$median_income, sj_data_cluster2$perc_away)

plot(sj_data_cluster3$median_income, sj_data_cluster3$perc_away)

plot(sj_data_cluster4$median_income, sj_data_cluster4$perc_away)

plot(sj_data_cluster5$median_income, sj_data_cluster5$perc_away)

We can plot where these clusters are.

sj_blockgroups_clusters <- sj_blockgroups %>%
                           inner_join(sj_data[,c('blockgroup', 'cluster')], by = c('GEOID' = 'blockgroup'))

mapview(sj_blockgroups_clusters, zcol = 'cluster')

The problems with clustering

  1. You can never be sure how many clusters you should have. Generally, less is better.
  2. Clustering can be a bit opaque because it doesn’t tell you the logic behind the clusters. You might have to figure it out yourself (somewhat similar to regression).